Skip to content

Conversation

@JonathanC-ARM
Copy link
Contributor

Description

Implementation of special sgemm path which uses GEMV kernels in cases where M or N are 1

Additionally this pr introduces the usage of a microkernel interface which utilizes typedef's provided by KleidiAI such that we can simplify the code and remove things such as ternary operations for SME1 vs SME2 kernels

Indicative Performance

In Lieu of any production models where gemv was a large contributor of the network. I opted to create a mini model to test which contains thousands of randomized matmul variants. With a distribution of GEMV cases throughout
image (6)

Using onnxruntime perf test I was able to half the total inference time vs mlas with this model
ort_ops_compare_gemv_no_2025-10-07_19-40-30_vs_gemv_2025-10-07_19-40-58

More Benchmarks to come shortly

Signed-off-by: Jonathan Clohessy <[email protected]>
@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Comment on lines 30 to 31
kai_matmul_clamp_f32_f32p_f32p_ukernel sgemm_gemm = GetKleidiAISGemmUKernel();
kai_matmul_clamp_f32_f32_f32p_ukernel sgemm_gemv = GetKleidiAISGemvUKernel();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetKleidiAIXUKernel() returns const&. do we need to make a copy here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kai_matmul_clamp_f32_f32p_f32p_ukernel sgemm_gemm = GetKleidiAISGemmUKernel();
kai_matmul_clamp_f32_f32_f32p_ukernel sgemm_gemv = GetKleidiAISGemvUKernel();
const kai_matmul_clamp_f32_f32p_f32p_ukernel& sgemm_gemm = GetKleidiAISGemmUKernel();
const kai_matmul_clamp_f32_f32_f32p_ukernel& sgemm_gemv = GetKleidiAISGemvUKernel();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to const in the latest push

//No fallback and putting in guards
if(MLAS_CPUIDINFO::GetCPUIDInfo().HasArm_SME()){
ArmKleidiAI::MlasDynamicQGemmBatch(Shape, DataParams, BatchN, ThreadPool);
if(ArmKleidiAI::SMEInfo::CanUseSME2){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are other places that need to be updated, like:

if (!CPUIDInfo::GetCPUIDInfo().HasArm_SME()) {

if (!MLAS_CPUIDINFO::GetCPUIDInfo().HasArm_SME()) {

I might be missing some.

I think it would be worth making a helper function like MlasIsDynamicQGemmAvailable that has the appropriate checks and using that instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in the updated checks in various places like these in the latest push

@JonathanC-ARM JonathanC-ARM force-pushed the jclohess_kleidiai_gemv_implementation branch from a3f4f5b to e8ab1b1 Compare October 24, 2025 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants